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Abstract 

In many real world problems, control decisions have to be made 
with limited information. The controller may have no a priori (or even 
posteriori) data on the nonlinear system, except from a limited number 
of points that are obtained over time. This is either due to high cost 
of observation or the highly non-stationary nature of the system. The 
resulting conflict between information collection (identification, explo- 
ration) and control (optimization, exploitation) necessitates an active 
learning approach for iteratively selecting the control actions which 
concurrently provide the data points for system identification. This 
paper presents a dual control approach where the information acquired 
at each control step is quantified using the entropy measure from in- 
formation theory and serves as the training input to a state-of-the-art 
Gaussian process regression (Bayesian learning) method. The explicit 
quantification of the information obtained from each data point allows 
for iterative optimization of both identification and control objectives. 
The approach developed is illustrated with two examples: control of 
logistic map as a chaotic system and position control of a cart with 
inverted pendulum. 



1 Introduction 

In many real world problems, control decisions have to be made with limited 
information. Obtaining extensive and accurate information about the con- 
trolled system can often be a costly and time consuming process. In some 
cases, acquiring detailed information on system characteristics may be sim- 
ply infeasible due to high observation costs. In others, the observed system 
may be so nonstationary that by the time the information is obtained, it is 
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already outdated due to system's fast-changing nature. Therefore, the only 
option left to the controller is to develop a strategy for collecting informa- 
tion efficiently and choose a model to estimate the "missing portions" of the 
system in order to control it according to a given objective. 

A variant of this problem has been well-known in the control literature 
since 1960s as dual control. The underlying concept in dual control is obtain- 
ing good process information through perturbation while controlling it. The 
controller has necessarily dual goals. First the controller must control the 
process as well as possible. Second, the controller must inject a probing sig- 
nal or perturbation to get more information about the process. By gaining 
more process information better control can be achieved in the future [,.20j. 

The problem considered here differs from the classical dual control prob- 
lem in the very limited amount of information available to the controller. 
The controller here cannot aim to identify the system first to obtain better 
performance in the future due to non-stationarity and/or prohibitive ob- 
servation costs. Furthermore, the perturbation idea is not fully applicable 
since each action-observation pair provides a single data point for identify- 
ing the nonlinear discrete-time system, unlike in the identification of (linear) 
continuous-time systems. 

This paper approaches the "dual control" problem from a Bayesian per- 
spective. Gaussian processes (GP) are utilized as a state-of-the-art regres- 
sion (function estimation) method for identifying the underlying state-space 
equations of the discrete-time nonlinear system from observed (training) 
data. More importantly, the adopted GP (Bayesian) framework allows ex- 
plicit quantification of information, which each observed data point provides 
within the a-priori chosen model. Hence, the information collection goal can 
be explicitly combined with the control objectives and posed as a (weighted- 
sum, multi-objective) optimization problem based on one (or multi-) step 
lookahead. This results in a joint and iterative scheme of active learning 
and control. 

The proposed approach consists of three main parts: observation, update 
of GP for regression, and optimization to determine the next control action. 
These three steps, shown in Figure [T] are taken iteratively to achieve the 
dual objectives of identification and control. 

Observations, given that they are a scarce resource in the class of prob- 
lems considered, play an important role in this approach. Uncertainties in 
the observed quantities can be modeled as additive noise. Likewise, proper- 
ties (variance or bias) of additive noise can be used to model the reliability 
of (and bias in) the data points observed. GPs provide a straightforward 
mathematical structure for incorporating these aspects to the model under 
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Figure 1: The underlying model of the dual control approach. 



some simplifying assumptions. 

The set of observations collected provide the (supervised) training data 
for GP regression in order to estimate the characteristics of the function or 
system at hand. This process relies on the GP methods, which will be de- 
scribed in Subsection 12. II Thus, at each iteration an up-to-date description 
of the function or system is obtained based on the latest observations. 

The final step of the approach provides a basis for determining the next 
control action based on an optimization process that takes into account dual 
objectives. The information measurement aspect of these objectives will be 
discussed in Subsection 12.21 An important issue here is the fact that there 
are infinitely many candidate points in this optimization process, but in 
practice only a finite collection of them can be evaluated. 

The investigated approach incorporates many concepts that have been 
implicitly considered by heuristic schemes, and builds upon results from 
seemingly disjoint but relevant fields such as information theory, machine 
learning, optimization, and control theory. Specifically, it combines concepts 
from these fields by 

• explicitly quantifying the information acquired using the entropy mea- 
sure from information theory, 

• modeling and estimating the (nonlinear) controlled system adopting a 
Bayesian approach and using Gaussian processes as a state-of-the-art 
regression method, 

• using an iterative scheme for observation, learning, and control. 
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• capturing all of these aspects under the umbrella of a multi-objective 
"meta" optimization and control formulation. 

Despite methods and approaches from machine (statistical) learning are 
heavily utilized in this framework, the problem at hand is very different 
from many classical machine learning ones, even in its learning aspect. In 
most classical application domains of machine learning such as data mining, 
computer vision, or image and voice recognition, the difficulty is often in 
handling significant amount of data in contrast to lack of it. Many methods 
such as Expectation-Maximization (EM) inherently make this assumption, 
except from "active learning" schemes [3j. Information theory plays plays 
an important role in evaluating scarce (and expensive) data and developing 
strategies for obtaining it. Interestingly, data scarcity converts at the same 
time the disadvantages of some methods into advantages, e.g. the scalability 
problem of Gaussian processes. 

It is worth noting that the class of problems described here are much 
more frequently encountered in practice than it may first seem. Social sys- 
tems and economics, where information is scarce and systems are very non- 
stationary by nature constitute an important application domain. The con- 
trol framework proposed is further applicable to a wide variety of fields due 
to its fundamentally adaptive nature. One example is decentralized resource 
allocation decisions in networked and complex systems, e.g. wired and wire- 
less networks, where parameters change quickly and global information on 
network characteristics is not available at the local decision- making nodes. 
Another example is security and information technology risk management 
in large-scale organizations, where acquiring information on individual sub- 
systems and processes can be very costly. Yet another example application 
is in biological systems where individual organisms or subsystems operate 
autonomously (even if they are part of a larger system) under limited local 
information. 

2 Methodology 

This section summarizes the results in [2] and presents the underlying meth- 
ods that are utilized within the dual control framework. First, the regression 
model and Gaussian Processes (GP) are presented. Subsequently, modeling 
and measurement of information is discussed using (Shannon) information 
theory. 
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2.1 Regression and Gaussian Processes (GP) 



The system identification problem here involves inferring the nonlinear func- 
tion(s) / in the state-space equations describing the system using the set of 
observed data points. This is known as regression in machine learning lit- 
erature, which is a supervised learning method since the data observed here 
is at the same time the training data. This learning process involves selec- 
tion of a "model", where the learned function / is, for example, expressed 
in terms of a set of parameters and specific basis functions. Gaussian pro- 
cesses (GP) provide a nonparametric alternative to this but follow in spirit 
the same idea. 

The main goal of regression involves a trade-off. On the one hand, it 
tries to minimize the observed error between / and /. On the other, it 
tries to infer the "real" shape of / and make good estimates using / even 
at unobserved points (generalization). If the former is overly emphasized, 
then one ends up with "over fitting", which means / follows / closely at 
observed points but has weak predictive value at unobserved ones. This 
delicate balance is usually achieved by balancing the prior "beliefs" on the 
nature of the function, captured by the model (basis functions), and fitting 
the model to the observed data. 

This paper focuses on Gaussian Process pT] as the chosen regression 
method within the proposed dual control approach without loss of any gen- 
erality. There are multiple reasons behind this preference. Firstly, GP pro- 
vides an elegant mathematical method for easily combining many aspects of 
the approach. Secondly, being a nonparametric method GP eliminates any 
discussion on model degree. Thirdly, it is easy to implement and understand 
as it is based on well-known Gaussian probability concepts. Fourthly, noise 
in observations is immediately taken into account if it is modeled as Gaus- 
sian. Finally, one of the main drawbacks of GP namely being computational 
heavy, does not really apply to the problem at hand since the amount of data 
available is already very limited. 

It is not possible to present here a comprehensive treatment of GP. There- 
fore, a very rudimentary overview is provided next within the context of the 
control problem. Consider a set of M data points 

= {xi,... ,xm}, 

where each x,- G ^ is a <i— dimensional vector, and the corresponding vector 
of scalar values is f{xi), i = 1, . . . ,M. Assume that the observations are dis- 
torted by a zero-mean Gaussian noise, n with variance a ~ ^(0,a). Then, 
the resulting observations is a vector of Gaussian y = f{x) +« ~ J^{f{x),o). 

5 



A GP is formally defined as a collection of random variables, any finite 
number of which have a joint Gaussian distribution [11]. It is completely 
specified by its mean function m{x) and covariance function C{x,x), where 

m{x)=E[f{x)] 

and 

C{x,x)=E[{f{x)-m{x)){f{x)-m{x))], \/x,xe&. 

Let us for simplicity choose m(x) = 0. Then, the GP is characterized 
entirely by its covariance function C{x,x). Since the noise in observation 
vector y is also Gaussian, the covariance function can be defined as the sum 
of a kernel function Q(x,x) and the diagonal noise variance 

C{x,x) = Q{x,x) + aI,yx,xe (2.1) 

where / is the identity matrix. While it is possible to choose here any 
(positive definite) kernel Q{-,-), one classical choice is 



Q{x,x) =exp 



1|| ~l|2 
2 " " 



(2.2) 



Note that GP makes use of the well-known kernel trick here by representing 
an infinite dimensional continuous function using a (finite) set of continuous 
basis functions and associated vector of real parameters in accordance with 
the representer theorem [12]. 

The (noisy )0 training set {&,y) is used to define the corresponding GP, 
?^c^(0,C(^)), through the M xM covariance function C(i^) = Q + aI, where 
the conditional Gaussian distribution of any point outside the training set, 
y G ^,y ^ ^, given the training data {^,t) can be computed as follows. 
Define the vector 

kix) = [Q{xi,x),...Q{xM,x)] (2.3) 

and scalar 

K = Q{x,x) + a. (2.4) 

Then, the conditional distribution /^(ylj) that characterizes the =!^(0,C) is 
a Gaussian ^(/,v) with mean / and variance v, 

f{x) = k^C^^y and v(x) = K-k^C^^k. (2.5) 



-'^The special case of perfect observation without noise is handled the same way as long 
as the kernel function Q{-,-) is positive definite. 
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This is a key result that defines GP regression as the mean function f{x) 
of the Gaussian distribution and provides a prediction of the function f{x). 
At the same time, it belongs to the well-defined class / G which is the 
set of all possible sample functions of the GP 

^ := {/(x) : ^ M such that / G ^^(0,C(^)), V^, C}, 

where C(^) is defined in ([H]) and through ([23]), ([I^D, and ([23]) . 
above. Furthermore, the variance function v(x) can be used to measure the 
uncertainty level of the predictions provided by /, which will be discussed 
in the next subsection. 

2.2 Quantifying the Information in Observations 

Each observation provides a data point to the regression problem (estimat- 
ing / by constructing /) as discussed in the previous subsection. Active 
learning addresses the question of "how to quantify information obtained 
and optimize the observation process?". Following the approach discussed 
in [HI [To], the approach here provides a precise answer to this question. 

Making any decision on the next (set of) observations in a principled 
manner necessitates first measuring the information obtained from each ob- 
servation within the adopted model. It is important to note that the infor- 
mation measure here is dependent on the chosen model. For example, the 
same observation provides a different amount of information to a random 
search model than a GP one. 

Shannon information theory readily provides the necessary mathemat- 
ical framework for measuring the information content of a variable. Let 
jc be a probability distribution over the set of possible values of a dis- 
crete random variable A. The entropy of the random variable is given by 
H{A) = Y,iPilog2{l/pi), which quantifies the amount of uncertainty. Then, 
the information obtained from an observation on the variable, i.e. reduction 
in uncertainty, can be quantified simply by taking the difference of its initial 
and final entropy, 

=Hq-Hi. 

It is important here to avoid the common conceptual pitfall of equat- 
ing entropy to information itself as it is sometimes done in communication 
theory literature. Since this issue is not of great importance for the class 
of problems considered in communication theory, it is often ignored. How- 
ever, the difference is of conceptual importance in this problem^ In this 

^See |http: //www, ccrnp.ncif erf .gov/~tonis/information. is .not .uncertainty .html 

for a detailed discussion. 
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case, (Shannon) information is defined as a measure of the decrease of un- 
certainty after (each) observation (within a given model). 

To apply this idea to GP, let the zero-mean multivariate Gaussian (nor- 
mal) probability distribution be denoted as 



p{x) 



1 



y^2n\Cp{x) 



-exp 



1 J 

-[x-m] \Cp{x) 



■ m 



(2.6) 



where x € ^ , | • | is the determinant, m is the mean (vector) as defined in 
()2.5p . and Cp{x) is the covariance matrix as a function of the newly observed 
point X £ ^ given by 



Cp{x) 



k{x) K 



(2.7) 



Here, the vector k{x) is defined in (|2.3p and K in (12. 4p . respectively. The 
matrix C{&) is the covariance matrix based on the training data 2i as defined 
in (f2TD . 

The entropy of the multivariate Gaussian distribution (j2.6p is [1] 

H{x) = ^ + ^\n{lK)+^-\n\Cp{x)\, 

where d is the dimension. Note that, this is the entropy of the GP estimate 
at the point x based on the available data The aggregate entropy of the 
function on the region ^ is given by 

H^88.- I Un\Cp{x)\dx. (2.8) 

The problem of choosing a new data point x such that the information 
obtained from it within the GP regression model is maximized can be for- 
mulated as: 

X = argmax = argmax / [Hq — H\] dx (2.9) 

= argmin/ -In |C„(x,x)|<ix, 
JxeX' 2 

where the integral is computed over all x € ^ ^ and the covariance matrix 
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Cq{x,x) is defined as 



cm 





Cq{x,X^ 



(2.10) 



kix) 
k{x) 



Q{x,x) 



Q{x,x) 



K 



and k = Q{x,x) + a. Here, C(^) is a M x M matrix and Cq is a (M + 2) x 
(M + 2) one, whereas K and Q{x,x) are scalars and ^ is a M x 1 vector. This 
result from [2] is summarized in the following proposition. 

Proposition 1. As a maximum information data collection strategy for a 
Gaussian Process with a covariance matrix C{Si), the next observation x 
should he chosen in such a way that 



where Cq{x,x) is defined in \2.10\) . 

An Approximate Solution to Information Maximization 

When making a decision on the next action through multi-objective opti- 
mization, there are (infinitely) many candidate points. A pragmatic solution 
to the problem of finding solution candidates is to (adaptively) sample the 
problem domain J^T to obtain the set 



that does not overlap with known points. In low (one or two) dimensions, 
this can be easily achieved through grid sampling methods. In higher di- 
mensions, (Quasi) Monte Carlo schemes can be utilized. For large problem 
domains, the current domain of interest J^T can be defined around the last 
or most promising observation in such a way that such a sampling is compu- 
tationally feasible. Likewise, multi-resolution schemes can also be deployed 
to increase computational efficiency. 

Given a set of (candidate) points sampled from J^T, the result in Propo- 
sition [1] can be revisited. The problem in ()2.9p is then approximated [15] 




:= {xi, . . . ,xt : X,- G ^ 



xi i ^, V/} 



by 




(2.11) 
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using monotonicity property of the natural logarithm and the fact that the 
determinant of a covariance matrix is non-negative. Thus, the following 
counterpart of Proposition [1] is obtained: 

Proposition 2. As an approximately maximum information data collection 
strategy for a Gaussian Process with a covariance matrix C{&) and given a 
collection of candidate points &, the next observation x£ & should be chosen 
in such a way that 



where Cq{x,x) is given in i2.10\) . 

Although it is an approximation, finding a solution to the optimization 
problem in Proposition [2] can still be computationally costly. Therefore, 
a greedy algorithm is proposed as a computationally simpler alternative. 
Choosing the maximum variance x as 



leads to a large (possibly largest) reduction in live© hence 
provides a rough approximate solution to (|2.1ip and to the result in Propo- 
sition [TJ This result from [2] is consistent with widely- known heuristics such 
as "maximum entropy" or "minimum variance" methods [14J and a variant 
has been discussed in [9]. 

Proposition 3. Given a Gaussian Process with a covariance matrix C{&) 
and a collection of candidate points 0, an approximate solution to the max- 
imum information data collection problem defined in Proposition [7] is to 
choose the sample point(s) x in such a way that it has (they have) the max- 
imum variance within the set &. 

3 Dual Control with Limited Information 

Consider a nonlinear discrete-time representation of a dynamical system that 
evolves on a J— dimensional state space C M'' steered by control actions 
chosen from an e— dimensional space C W. Usually, the dimension of 
the control space is smaller than the state one, e < d. It is assumed here 
for simplicity that both control and state spaces are nonempty, convex, and 
compact. The system states evolve according to 



X = argmin FT |Q(x,x) j ^ arg max 
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where x{t) G , Xi{t) is a scalar, t = 1,... denotes discrete time instances, 
and each : JT^ x R is a possibly nonlinear function. States of dy- 

namical systems are, however, often not observable. Therefore, define a 
mapping from the states to observable quantities y as 

yj{t)=gj{xit)), j=l,...,d, (3.2) 

where each gj : ^ M is possibly a nonlinear function, and d <d. 

If nothing is known about the dynamic system defined by (j3.ip - (|3.ip in 
the beginning, and there is no observation or system noise, then the system 
can be simplified to its input-output relationship: 

yj{t + l)=gj{f[g-\y(t))Mt)]) 

(3.3) 

^ yj{t + l) = hj{y{t),u{t)), j=l,-.-,d, 

where each hj : x — > M is possibly a nonlinear function. As a simpli- 
fication, system and observation noise can be modeled as zero-mean Gaus- 
siaiH. Thus, a noisy variant of system ()3.3p is 

yj{t + \) = hj{y{t),u{t))+n{t), j = l,...,J, (3.4) 

where n{t) ~ c/K(0,a) and a is the respective noise variance. 

3.1 Problem Formulation 

The dual control problem is defined as follows. Consider an unknown non- 
linear discrete-time dynamic system, which has a control input and a (par- 
tially) observable output that is possibly distorted by noise. The control 
input may affect the system linearly, which leads to a simpler problem, or 
its effect may be nonlinear and unknown to the decision maker. The ob- 
jective of the decision maker is to control the system in such a way that it 
follows a given reference signal. Each action taken is assumed to be very 
costly and the decision maker may only have limited time to satisfy dual 
goals of identification and control. What is the best strategy to address this 
problemf 

Based on the discussion above, the described problem can be formulated 
more concretely. Let r{t) £ Vf denote the J— dimensional reference signal. 
The discrete-time nonlinear system can be modeled using p.4p . where y{t) 
is the output, u{t) is the control action, and n{t) is the observation noise at 
time t. Then, the following dual control problem is formulated. 

^Biased Gaussian noise can be easily handled by GPs by introducing a mean function, 
which we omit in this paper for simplicity. 
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Problem 1. [Dual Control under Limited Information] Let a discrete-time 
system be described by the following input-output relationship 

yj{t + \)=hj{y{t),u{t))+n{t), j = \,...,d 

where y{t) is the <i— dimensional output, u{t) is the dimensional control 
action, and n{t) ~ ryV{0,<j) is a zero-mean Gaussian observation noise with 
variance (7 at time t. The function hj : J^T^ x '^'^ — )• M is possibly nonlinear 
for all j. Given a J— dimensional desired reference signal what is the 
best control strategy (series of control actions) lJ,{t) such that 

= argmin \\y{t) - r{t)\\ , Vf = 1, . . . , 

u{t) 

||-,-|| is a norm quantifying the mismatch between the observed and desired 
outputs? 

If there was more information on the system available or more time for 
experimentation, one could have resorted to the rich literature on adaptive 
and robust control to find a solution. However, Problem[T]differentiates from 
the ones in the classical adaptive and robust control literature by the fact 
that the decision maker starts with zero or very little prior information and a 
solution has to be found online while learning the system. This puts special 
emphasis on observations and quantifying information using the methods 
described in Section 12.21 

Using GP regression for estimating the system dynamics in (13. 4p and 
Shannon information theory to measure and maximize the amount of infor- 
mation obtained with each observation, a model-based variant of Problem [T] 
is defined. 

Problem 2. [Model-based Control under Limited Information] Let a discrete 
time dynamic system be described by the following input-output relationship 

yj{t + l) = hj{y{t)Mt)) + n{t), j=l,...,d 

where y{t) is the <i— dimensional output, u{t) is the e— dimensional control 
action, and n{t) ~ ^(0,a) is a zero-mean Gaussian observation noise with 
variance a at time t. The function hj : x '^'^ — >■ R is possibly nonlinear 
for all j. The goal is to control the system in such a way that the output 
y{t) follows a given dimensional reference signal r{t). 

Let h be an estimate of system dynamics h based on an a priori model 
and a set of observations. What is the best control strategy (series of con- 
trol actions) jX^t) that solves the multi-objective problem with the following 
components? 
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• Objective 1: miriue'?/ \\y{t) - r{t)\\ 

• Objective 2: maXu^oy/ J^(h,u{t)) 

The main (first) objective of Problem [2] is naturally the same as the one 
of Problem [TJ The second objective states the "exploration" or information 
collection aspect. 

As a side note, unlike the static optimization problem in [2], how close 
the estimated system dynamic h approximates the original one is not set 
as an objective. The reason behind this is the fact that the data points 
used for identifying h can only be selected indirectly through control actions 
u. Therefore, a reasonably complete identification of the system dynamics 
may be too costly. A partial identification relevant to the main objective is 
sufficient for the purpose here. 



3.2 Solution Approach 

The solution approach to Problem [2] utilizes the methodology in Section [2j 
The GP variance maximization approximates here the information maxi- 
mization objective. A (random or grid-based) sampling scheme is adopted 
again for evaluating candidate solutions, in this combination of the 

observed current state and available control actions. A weighted-sum scheme 
is utilized to combine the two objectives in Problem [2l A visual depiction 
of the control framework is shown in Figured 
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Figure 2: The dual control framework for identifying and controlling an 
unknown dynamic system with limited information. 

Since the problem is by its very nature iterative, the best control strategy 
has to be evaluated at the current state, taking into account newly received 
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information and using the latest update of estimated system dynamics. As 
a starting point, a gradient or greedy algorithm is proposed which aims to 
balance both exploration and exploitation objectives. 

Proposition 4. Let a discrete time dynamic system he described by the 
following input- output relationship 

yj{t + \)=hj{y{t),u{t))+n{t), y{t) G u{t) G ^^ 

j = I,.. . ,d, where n{t) ~ c/K(0, a) is a zero-mean Gaussian observation noise 
with variance a. Further let ^ be a grid-based or randomly sampled set of 
available control actions u from the control space . Given a reference 
signal r[t) S , define the optimization problem 

min„(f)e<i>/(M,3',?-,H') 

(3.5) 

^min„(,)ecwi \\y{t + \) - r{t)\\-W2v{y{t + l),^{t)), 

where 

yj{t + l)=hj{y{t),u{t))+n{t) 

is the next estimated output using a GP based on control u{t), and v{y{t -\- 
\),u{t)) is the variance of the associated Gaussian as defined in 12. 5\) . The 
solution to this problem 

= arg min J{u,y,r,w), f = 1, . . . 
M(f)e<J> 

approximates the best control strategy under limited information, and hence 
approximately solves Problem\^ 

Couple of remarks should be made at this point regarding the solution 
approach presented. Firstly, the approach in Proposition [J] constitutes a 
greedy one, which aims to solve the problem in shortest time based on 
available information and goes in the direction of the steepest gradient (here 
of the weighted sum of objectives). The main concern here is whether such 
an algorithm gets stuck in a local minimum. This issue can be remedied 
at least partially by putting a higher weight to the information collection 
objective. Secondly, it is implicitly assumed here that the system at hand 
is at least partially observable and controllable. It is naturally difficult, if 
not impossible, to check such properties of an unknown system. Thus, the 
approach here can be interpreted also as a "best effort" one, which aims to 
achieve the best performance possible given controllability and observability 
limitations. 

A summary of the solution approach discussed above for a specific set of 
choices is provided by Algorithm [TJ 
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Algorithm 1 Dual Control with Limited hiformation 

Input: Problem domain, GP meta-parameters, objective weights [h'i,h'2], 
initial data set reference signal r, control actions <I>. 
while system is (partially) observable and controllable do 

Estimate the system dynamics (I/O function) h using GP. 

Compute the best control action m S <I> solving ()3.5p . 

Compute variance, v{y,u), of h as an estimate of ^^{h). 

Update the data set ^ using newly observed data point y. 
end while 

4 Examples 

4.1 Dual Control of Logistic Map 

The logistic map 

x{n+ 1) = rx{n) (1 —x{n)) , 

parameterized by the scalar r is a well-known one-dimensional discrete-time 
nonlinear system, where n denotes the time step or iteration. It is chosen as 
an illustrative example due to its interesting properties and for visualization 
purposes. For r = 3.5, logistic map converges to a limit cycle while it exhibits 
chaotic behavior for r = 3.8 as shown in Figure [3j 



Trajectory of Logistic Map 

1 1 > 1 1 1 1 




Time steps 

Figure 3: Example trajectories of the logistic map for r = 3.5 and r = 3.8. 
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Linear Control: 

First, the logistic map is controlled with additive actions while being iden- 
tified using the GP method described in Algorithm [1} 

x{n + 1) = rx{n) (1 —x{n)) + u{n). 

The controller knows here that the control is linear (additive), and utilizes 
this extra knowledge in identifying the system which simplifies the problem 
significantly. The system description (input-output relationship) from the 
perspective of the controller is: 

y{n+l)=h{y{n)) + u{n). 

The control actions are taken from the finite set 

^ = {ui G [-1,1] : Ui+i =M,+0.02, / = 1,...,101}. 

The kernel variance is 0.5 and the weights in the objective function (|3.5p 
are chosen as wi = W2 = 1. The goal is stabilize the system at x* = 0.8, 
which constitutes the constant reference signal. The starting point is xq = 
0.1. The control actions and state estimation errors over time (in each step 
based on arrived data points) for r = 3.5 and the corresponding trajectory 
of the logistic map are depicted in Figures H] and [5l Note that, in this 
case the logistic map acts only as a nonlinear system with a limit cycle 
rather than behaving chaotically. It is observed that approximately the first 
10 steps are used by the algorithm to explore or learn the system after 
which the trajectory approaches to the target. The Figure [6] shows the 
estimated function versus the original mapping for m = as well as one 
standard deviation from estimated value. It can be seen that the variance 
is minimum, i.e. the estimate is best, around the target value. 

The same numerical analysis is repeated for r = 3.8 in which case the 
logistic map behaves chaotically and the task turns to from control of an 
unknown nonlinear system to control of an unknown chaotic system. In this 
case, the goal is to stabilize the system at x* = 0.8. The control actions and 
state estimation errors over time (in each step based on arrived data points) 
for r = 3.8 and the corresponding trajectory of the logistic map are depicted 
in Figures [7] and [8j Note that the learning process takes longer in this case 
possibly due to the chaotic (complex) behavior of the system. The mapping 
shown in Figure [9] shows the estimated function versus the original mapping 
for u = 1.5. 
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Control Actions and Estimation Error 




20 30 
Time steps 



Figure 4: The control actions and state estimation errors for logistic map 
with r = 3.5 and linear control. 



Trajectory of the Controiied System 




Figm'e 5: The controlled trajectory of the logistic map for r = 3.5 and linear 
control. 
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Actual versus Learned Mapping 




Figure 6: The logistic map and its estimate along with one standard devia- 
tion for u = and r = 3.5 after 100 iterations (data points). 



Control Actions and Estimation Error 
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Figure 7: The control actions and state estimation errors for logistic map 
with r = 3.8 and linear control 
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Trajectory of the Controlled System 




10 20 30 40 50 



Time steps 

Figure 8: The controlled trajectory of the logistic map for r = 3.8 and linear 
control. 




Figure 9: The logistic map and its estimate along with one standard devia- 
tion for M = 1.5 and r = 3.8 after 100 iterations (data points). 
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Nonlinear and Unknown Control: 



Next, the logistic map is controlled with actions that affect the system non- 
linearly in a way that is unknown to the controller: 

x{n+ 1) = 3.Sx{n) (1 —x{n)) +cos(m). 

The system description (input-output relationship) from the perspective of 
the controller is: 



Compared to the linear and known control case, this problem is obviously 
much harder to address. The control actions are taken from the finite set 



The weights in the objective function (13. 5p are chosen initially as wi = 1 
and W2 = to emphasize exploration in the beginning but W2 is increased 
gradually to W2 = 40 to achieve as good control performance as possible. 

Figures [TOl nil and 1121 summarize the obtained results. Since the objec- 
tive of the Algorithm [T] is not only learning the entire system behavior but 
achieving the control target in a greedy manner, the system is estimated 
accurately only around the target value. It is observed that the learning 
process takes longer (twice as much of the case in the linear control) and the 
control actions are less accurate. It should be kept in mind, however, that 
concurrently identifying and adaptively controlling a chaotic system with 
limited information is not an easy task. 

4.2 Position Control of a Cart with Inverted Pendulum 

The inverted pendulum on a cart is a classic example system for control 
problems. In this case, the problem is formulated as the position control of 
the cart with the inverted pendulum, which is defined by the following set 
of discrete-time nonlinear state-space equations |19|, [18] : 



y{n+l) =h{y{n),u{n)). 



O = {ui G [0, tt] : Mi+i = M,- -I- 0. 1 , / = 1 , . . . 



32}. 




(4.1) 



X2{n+l) =X2{n) + 



T 




(4.2) 



M + msm^{xT,{n)) 




mgcos{x^{n))sm{x^{n))] , 
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Control Actions and Estimation Error 
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Figure 10: The control actions and state estimation errors for logistic map 
with r = 3.8 and nonlinear control. 



Trajectory of ttie Controlled System 
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Figm'e 11: The controlled trajectory of the logistic map for r = 3.8 and 
nonlinear control. 
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Actual versus Learned Mapping 
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Figure 12: The logistic map and its estimate along with one standard devi- 
ation for M = 1.5 and r = 3.8 after 100 iterations (data points). 



where T = 0.05 is the sampling period, y = xi is the position of the cart, X2 = 
dx/dt is the cart velocity xt, = B \s the inverted pendulum angle, X4 = d6/dt 
is the angular velocity. The parameter values are: b = 12.98, M = 1.378, 
L = 0.325, g = 9.8, and m = 0.051. Further details on this standard model 
are available in [19l |18] . 

First, the cart is controlled using a one-step look-ahead strategy with 
full knowledge from the starting point x = [0, 0, 0.6, 0] with control actions 
chosen from the set {m,- G [—10, 10] : = 1, / = 1, . . . ,21}. The objective 
is to fix the position of the cart to y* = 0.5. The weights in the objective 
function (j3.5p are wi = 1 and W2 = 20. The results of this case shown in 
Figures [T3l and [T4l provide a benchmark to compare against. 

Next, the cart is controlled using a one-step look- ahead strategy as a as 




XT,(n + 1) =xj,{n) + Tx4(n) 
T 



(4.3) 



(4.4) 



[— u(«)cos(x3(«)) -|- {M + m)gsm{xi,{n)) 
bx2{n) cos{xT,{n)) — mLxl{n) cos{x^{n)) sm{x^{n))^ , 



y{n) =xi{n), 



(4.5) 
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Control Actions 
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Figure 13: The control actions for the cart with inverted pendulum, single- 
step look ahead, and full knowledge. 



Position and Velocity of Cart 
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Figure 14: The trajectory of the cart with inverted pendulum controlled 
with full knowledge and single-step look ahead. 
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black-box system; y{n-\- 1) = h{y{n),u{n)). As side information, the controller 
knows ([iJ]) . but has to estimate while and effectively act 
as external/mimodeled dynamics. The kernel and noise variance in GP are 
chosen as 0.5 and 0.01, respectively. The results obtained using Algorithm [1] 
are shown in Figures [T^] and [1^1 The performance is satisfactory considering 
that the trajectory is within 10% distance of the target within 30 steps. 

Control Actions 
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Figure 15: Dual control of the cart with inverted pendulum and single-step 
look ahead. 



5 Literature Review 

The book [lOJ provides important and valuable insights into the relation- 
ship between information theory, inference, and learning, where measuring 
information content of data points using Shannon information is discussed. 
However, focusing mainly on more traditional coding, communication, and 
machine learning topics, the book does not discuss the type of control prob- 
lems presented in this paper. 

Learning plays an important role in the presented framework, especially 
regression, which is a classical machine (or statistical) learning method. A 
very good introduction to the subject can be found in [3]. A complemen- 
tary and detailed discussion on kernel methods is in [12]. Another relevant 
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Figure 16: The trajectory of the cart with inverted pendulum under dual 
control with single-step look ahead estimates. 

topic is Bayesian inference [171 [TO] . which is in the foundation of the pre- 
sented framework. In machine learning literature, Gaussian processes (GPs) 
are getting increasingly popular due to their various favorable characteris- 
tics. The book [llj presents a comprehensive treatment of GPs. Additional 
relevant works on the subject include [lOl [121 E], which also discuss GP 
regression. 

Gaussian processes have been recently applied to the area of optimiza- 
tion and regression |4j as well as system identification While the latter 
mentions active learning [TJ], neither work discusses explicit information 
quantification or builds a connection with Shannon information theory. Us- 
ing GP for system identification is discussed again in ff], yet again without 
information collection aspects. The paper |9j discusses in a static optimiza- 
tion setting objective functions which measure the expected informativeness 
of candidate measurements within a Bayesian learning framework. The sub- 
sequent study [135 investigates active learning for GP regression in machine 
learning applications using variance as a (heuristic) confidence measure for 
test point rejection. 

Dual control is an old topic, which has attracted the interest of the 
research community in the second half of the last century [20] . The article 
[21] revisits this subject and incorporates information explicitly into the 
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dual control problem, but focuses on estimation of parameters in a known, 
linear system. Adopting a different perspective, a dynamic programming 
approach is presented recently in [5], where an approximate value- function 
based reinforcement learning algorithm based on GPs and its online variant 
are presented. An application of GP-based identification and control to an 
autonomous blimp is discussed in [6]. 

6 Conclusion 

The dual control approach presented in this paper addresses focuses on 
black-box control with very limited information. The information acquired 
at each control step is quantified using the entropy measure from informa- 
tion theory and serves as the training input to a state-of-the-art Gaussian 
process regression (Bayesian learning) method. The quantification of the 
information obtained from each data point allows for iterative and joint 
optimization of both identification and control objectives. The results ob- 
tained from two illustrative examples, control of logistic map as a chaotic 
system and position control of a cart with inverted pendulum, demonstrate 
the developed approach. 

The dynamic control problem in this paper differs from the static opti- 
mization analysis in [2j in multiple ways. One of the main differences is the 
fact that the system states are now influenced indirectly through control ac- 
tions. The data points used for identifying the underlying system mapping 
can only be selected indirectly (unlike static optimization) and under the 
constraints imposed by the nature of the "control" in the dynamic system at 
hand. 

The presented results should be considered mainly as an initial step. 
Future research directions are abundant and include further investigation 
of the exploration-exploitation trade-off, more elaborate adaptive weight- 
ing parameters, and random sampling methods for problems in higher di- 
mensional spaces. Applications to multi-person decision-making and game 
theory constitute another interesting future research topic. 
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